Rethinking Spatiotemporal Feature Learning For Video Understanding

نویسندگان

Saining Xie

Chen Sun

Jonathan Huang

Zhuowen Tu

Kevin Murphy

چکیده

In this paper we study 3D convolutional networks for video understanding tasks. Our starting point is the stateof-the-art I3D model of [3], which “inflates” all the 2D filters of the Inception architecture to 3D. We first consider “deflating” the I3D model at various levels to understand the role of 3D convolutions. Interestingly, we found that 3D convolutions at the top layers of the network contribute more than 3D convolutions at the bottom layers, while also being computationally more efficient. This indicates that I3D is better at capturing high-level temporal patterns than low-level motion signals. We also consider replacing 3D convolutions with spatiotemporal-separable 3D convolutions (i.e., replacing convolution using a kt×k×k filter with 1× k× k followed by kt× 1× 1 filters); we show that such a model, which we call S3D, is 1.5x more computationally efficient (in terms of FLOPS) than I3D, and achieves better accuracy. Finally, we explore spatiotemporal feature gating on top of S3D. The resulting model, which we call S3D-G, outperforms the state-of-the-art I3D model by 3.5% accuracy on Kinetics and reduces the FLOPS by 34%. It also achieves a new state-of-the-art performance when transferred to other action classification (UCF-101 and HMDB51) and detection (UCF-101 and JHMDB) datasets.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

ConvNet Architecture Search for Spatiotemporal Feature Learning

Learning image representations with ConvNets by pretraining on ImageNet has proven useful across many visual understanding tasks including object detection, semantic segmentation, and image captioning. Although any image representation can be applied to video frames, a dedicated spatiotemporal representation is still vital in order to incorporate motion patterns that cannot be captured by appea...

متن کامل

Learning spatiotemporal features by using independent component analysis with application to facial expression recognition

Engineered features have been heavily employed in computer vision. Recently, feature learning from unlabeled data for improving the performance of a given vision task has received increasing attention in both machine learning and computer vision. In this paper, we present using unlabeled video data to learn spatiotemporal features for video classification tasks. Specifically, we employ independ...

متن کامل

Learning semantic features for action recognition via diffusion maps

Efficient modeling of actions is critical for recognizing human actions. Recently, bag of video words (BoVW) representation, in which features computed around spatiotemporal interest points are quantized into video words based on their appearance similarity, has been widely and successfully explored. The performance of this representation however, is highly sensitive to two main factors: the gr...

متن کامل

The Effect of Family Nursing Education Using Reflection Method with the Help of Situation Simulation Through Video Screening on Learning and Perspective of Nursing Students

Introduction: Reflection is one of the basic methods of education that is effective in raising the level of awareness and skills in clinical situations. The aim of this study was to investigate the effect of family nursing education using reflection method with the help of situation simulation through video screening on learning and perspective of nursing students. Methods: This quasi-experimen...

متن کامل

A Conversation Analytic Study on the Teachers’ Management of Understanding-Check Question Sequences in EFL Classrooms

Teacher questions are claimed to be constitutive of classroom interaction because of their crucial role both in the construction of knowledge and the organization of classroom proceedings (Dalton Puffer, 2007). Most of previous research on teachers’ questions mainly focused on identifying and discovering different question types believed to be helpful in creating the opportunities for learners’...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1712.04851 شماره

صفحات -

تاریخ انتشار 2017

Rethinking Spatiotemporal Feature Learning For Video Understanding

نویسندگان

چکیده

منابع مشابه

ConvNet Architecture Search for Spatiotemporal Feature Learning

Learning spatiotemporal features by using independent component analysis with application to facial expression recognition

Learning semantic features for action recognition via diffusion maps

The Effect of Family Nursing Education Using Reflection Method with the Help of Situation Simulation Through Video Screening on Learning and Perspective of Nursing Students

A Conversation Analytic Study on the Teachers’ Management of Understanding-Check Question Sequences in EFL Classrooms

عنوان ژورنال:

اشتراک گذاری